3.4 Bayes Estimation for Bayesian

#BayesEstimation #JeffreyPrior #FisherInformation #UMVUE #GaussianSequenceModel #GaussianHierarchicalModel

1 Interpretations of Probability

What does "probability" mean in the real world?

Long-run frequency over repeated trials.
Ex, repeatedly flipping a coin, shooting electrons at a double slit.
systematic random sampling from a population.
Ex, survey of $500$ random voters, random assignment to treatment/control group. Randomness comes from experimenters' actions.
Subjective uncertainty about an outcome.
Ex, chance that Trump is elected, $100$ th digit of $π$ is $5$ . Could be broad intersubjective agreement.

2 Where does Prior $Λ$ Come From?

2.1 Subjective Beliefs

Subjective beliefs bring all relevant information to bear straightforward interpretation of posterior. But posterior is therefore subjective.

2.2 "Objective" or "Vague" Prior

Using default prior removes subjectivity. Like flat prior $λ (θ) \propto_{θ} 1$ on $Θ$ .

Example

If $θ$ is flat prior on $R$ , and $X | θ \sim N (θ, σ^{2})$ , then $λ (θ | X) \propto_{θ} p_{θ} (x) = \frac{1}{\sqrt{2 π}} e^{- \frac{(x - θ)^{2}}{2 σ^{2}}} \propto_{θ} N (x, σ^{2}) .$

We also have Jeffrey's prior $λ (θ) \propto_{θ} | J (θ) |^{\frac{1}{2}}$ (recall Fisher information). In this case, recall this formula, we have $\frac{\sqrt{D_{KL} (p_{θ} | | p_{θ^{*}})}}{| θ^{*} - θ |} \approx \frac{\sqrt{J (θ)}}{\sqrt{2}} .$ Then when $ε$ is small, $Λ ([θ, θ + ε]) \approx ε λ (θ) \propto_{θ} \sqrt{2 D_{KL} (p_{θ} | | p_{θ + ε})} .$
$Λ (θ)$ has higher density when $p_{θ}$ is "changing faster".

Example (Binomial)

If $X | θ \sim Binomial (n, θ)$ , $λ (θ) \propto_{θ} | J (θ) |^{\frac{1}{2}} = {(\frac{n}{θ (1 - θ)})}^{\frac{1}{2}} \propto_{θ} Beta (\frac{1}{2}, \frac{1}{2}) .$ So $λ (θ) \to \infty$ as $θ \to 0$ or $1$ . And we can see that $D_{KL} (0.001 | | 0.001) ≫ D_{KL} (0.49 | | 0.5) .$

Take coin tossing as an example:
Pasted image 20241208201515.png|400

2.2.1 Intersubjective Agreement

Data may effectively rule out most $θ$ values. This will make posterior uncontroversial.

Example

$X \sim Binom (10^{4}, θ)$ . Observe $X = 3000$ . The standard deviation ${SD}_{θ} (\frac{X}{n}) = \sqrt{\frac{θ (1 - θ)}{n}} \leq 0.005,$ so likelihood $p (θ; X) \approx 0$ for $X$ outside $C = [0.29, 0.31]$ . So all "reasonable" priors may be close to flat on $C$ . So $\begin{aligned} λ (θ | X) & {\tilde{\propto}}_{θ} p (θ; X) \\ {\tilde{\propto}}_{θ} \exp {- \frac{J (0.3)}{2} (θ - 0.3)^{2}} \\ \propto_{θ} N (0.3, J (0.3)^{- 1}) . \end{aligned}$
"Data swamps everyone's prior."

2.2.2 Gaussian Sequence Model

Let $X | θ \sim N_{d} (μ, I_{d})$ , $μ \in R^{d}$ . Jefferey's prior is flat: $λ (μ) \propto_{μ} I$ .^[1]

So $λ (μ | X) = N_{d} (X, I_{d}) \Rightarrow E [μ | X] = X$ . This is the same as UMVUE.

For $p^{2} = | | μ | |^{2}$ , recall $μ \sim N_{d} (X, I_{d}) \Rightarrow E [| | μ | |^{2} | X] = | | X | |^{2} + d$ , and note that $δ_{UMVU} (X) = | | X | |^{2} - d$ , so $δ_{λ} (x) = δ_{UMVU} (x) + 2 d$ .
Now apply bias-variance tradeoff: $MSE (θ; δ_{λ}) = {Var}_{θ} (δ_{λ}) + {Bias}_{θ} (δ_{λ})^{2} = {Var}_{θ} (δ_{UMVU}) + 4 d^{2} .$
Examine Jeffrey's prior again: $\begin{aligned} P (p^{2} \leq t) = Vol (Ball of radius \sqrt{t}) = const (d) \cdot t^{\frac{d}{2}} \\ \Rightarrow & λ (p^{2}) \propto_{p^{2}} (p^{2})^{\frac{d}{2} - 1} = p^{d - 2}, \end{aligned}$ which grows rapidly. This shows prior "expects" $p^{2}$ to be large.

2.3 Prior or Concurrent Experience

Example

Estimate same-side bias for $m = 48$ coin flippers. Flipper $i$ has $n_{i}$ trials, "true" same-side probability $θ_{i}$ . Construct a hierarchical model: for flippers $i = 1, \dots, m$ , $α, β \sim λ$ ,^[2] $θ_{i} | α, β \overset{i . i . d}{\sim} Beta (α, β)$ , $X_{i} | α, β, θ \overset{i . n . d}{\sim} Binomial (n_{i}, θ_{i})$ . So $E [θ_{i} | X, α, β] = \frac{X_{i} + α}{n_{i} + α + β},$ and $E [θ_{i} | X] = E [E [θ_{i} | X, α, β] | X] = \iint \frac{x_{i} + α}{n_{i} + α + β} λ (α, β | x) d α d β .$ If $m$ is large, $α, β$ may be "almost known", so choice of $λ$ doesn't matter much.

2.3.1 Flexibility of Bayes

Given any $Λ, P, L, g (θ)$ , $δ_{Λ}$ is defined straightforwardly by $δ_{Λ} (x) = \arg min_{d} \int L (θ, d) λ (θ | x) d θ .$ (Recall here)So problem is reduced to (possibly hard) computation. So posterior is "one stop shop" for all answers. There is no need for

Special family structure (exponential family, complete sufficient statistic, etc.)
special estimator (like U-estimable)
convex or nice $L$

So it is highly expressive modeling and estimation. (The caveat is the limitation by ability to do computations)

2.4 Convenience Priors

We can choose conjugate or other "nice" priors, so computations will be much faster, especially in high dimensions.

Example

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p$ , $p$ is an unknown density on $R$ .
Estimand: $m = median (p)$ .
Estimator: $δ (X) = median (X)$ . However, for large $n$ , $δ (x) \approx N (m, \frac{1}{4 n p (m)})$ . This is not Bayes for any realistic prior.
Bayes approach:

Define prior over $p$ .
Calculate posterior.
Return something like $E [m | X]$ .

3 Hierarchical Bayes

Full power of Bayes is realized in large, complex problems with repeat structure, allowing us to pool information across many observations.

Example

Predict a batter's "true" batting average from $n_{i}$ bats. $X_{i}$ is the number of hits, $X_{i} \sim Binomial (n_{i}, θ_{i})$ . Pool information across players $i = 1, \dots, m$ via hierarchical model: $\begin{aligned} α, β & \sim λ_{0} (α, β) \\ θ_{i} | α, β & \overset{i . i . d}{\sim} Beta (α, β), i \leq m, \\ X_{i} | θ_{i} & \overset{i . n . d}{\sim} Binomial (n_{i}, θ_{i}), i \leq m, \end{aligned}$ so $E [θ_{i} | X] = E [E [θ_{i} | X, α, β] | X] = E [\frac{X_{i} + α}{n_{i} + α + β} | X] .$

3.1 Gaussian Hierarchical Model

Now suppose $\begin{aligned} τ^{2} & \sim λ_{0}, \\ θ_{i} | τ^{2} & \overset{i . i . d}{\sim} N (0, τ^{2}), i \leq d, \\ X_{i} | τ^{2}, θ & \overset{i . n . d}{\sim} N (θ_{i}, 1) . \end{aligned}$
Calculate the posterior mean: $\begin{aligned} δ (x_{i}) & = E [θ_{i} | x] \\ = E [E [θ_{i} | x, τ^{2}] | X] \\ = E [\frac{τ^{2}}{1 + τ^{2}} X_{i} | X] \cdot X_{i} . \end{aligned}$

Linear shrinkage estimator, Bayes-optimal shrinkage estimated from data.

Likelihood for $τ^{2}$ : marginalized over $θ_{i}$ : $\begin{aligned} X_{i} | τ^{2} \overset{i . i . d}{\sim} N (0, 1 + τ^{2}) \\ \Rightarrow & \frac{1}{d} | | X | |^{2} \sim \frac{1 + τ^{2}}{d} χ_{d}^{2} \sim (1 + τ^{2}, \frac{2 (1 + τ^{2})}{d}) . \end{aligned}$
Define $ζ (τ^{2}) = \frac{1}{1 + τ^{2}}$ , (amount of shrinkage) so $δ (X) = (1 - E [ζ | X]) X_{i}$ .

X | ζ \sim N_{d} (0, \frac{1}{s} I_{d}) = \frac{1}{(2 π / ζ)^{\frac{d}{2}}} e^{- \frac{| | x | |^{2} ζ}{2}} \propto_{ζ} ζ^{\frac{d}{2}} e^{- \frac{ζ | | X | |^{2}}{2}} .

Conjugate prior: $\begin{aligned} ζ & \sim \frac{1}{s^{2}} χ_{k}^{2} = Γ (\frac{k}{2}, \frac{2}{s^{2}}) = \frac{(s^{2})^{\frac{k}{2}}}{Γ (\frac{k}{2})} ζ^{\frac{k}{2} - 1} e^{- \frac{s^{2} ζ}{2}} \\ \Rightarrow ζ | | | X | |^{2} & \propto_{ζ} ζ^{\frac{k + d}{2} - 1} e^{- (s + | | X | |^{2}) ζ} \sim \frac{χ_{k + d}^{2}}{s + | | X | |^{2}} \\ \Rightarrow E [ζ | | | X | |^{2}] & = \frac{k + d}{s + | | X | |^{2}} . \end{aligned}$
"pseudo-data" $= Y_{1}, \dots, Y_{k}$ with $| | Y | |^{2} = s^{2}$ . We might want to truncate prior to $[0, 1]$ if $d$ is small.
#？

3.2 Graphical Form

Pasted image 20241208233548.png|600
These are directed graphical models. Implies the distribution may be factorized with one factor for each vertex in $(V, E)$ : $p (z_{1}, \dots, z_{| V |}) = \prod_{i = 1}^{| V |} p_{i} (z_{i} | z_{p_{a} (i)}),$ $p_{a} (i) = {j : j \to i}$ . For this model, $p (τ^{2}, θ_{1}, \dots, θ_{m}; x_{1}, \dots, x_{m}) = p (τ^{2}) \prod_{i = 1}^{m} p (θ_{i} | τ^{2}) \prod_{i = 1}^{n} p (x_{i} | θ_{i}) .$

For normal distribution $l (θ; x) \propto_{θ} - \frac{(x - μ)^{2}}{2 σ^{2}}$ , so $\nabla^{2} l = const$ . Recall Fisher information is the minus expectation of $\nabla^{2} l$ . ↩︎
$α, β$ are "hyperparameters" here; $λ$ is "hyperprior". ↩︎

1 Interpretations of Probability

2 Where does Prior Λ Come From?

2.1 Subjective Beliefs

2.2 "Objective" or "Vague" Prior

2.2.1 Intersubjective Agreement

2.2.2 Gaussian Sequence Model

2.3 Prior or Concurrent Experience

2.3.1 Flexibility of Bayes

2.4 Convenience Priors

3 Hierarchical Bayes

3.1 Gaussian Hierarchical Model

3.2 Graphical Form

2 Where does Prior $Λ$ Come From?